Research Trends in Genetic Disease: A Data Science Tutorial¶

By William Agan¶

Hello! In this tutorial project I'll be taking you through the basics of the entire data science pipeline. The topic we'll be looking at today is genetic disease, with articles collected from the National Library of Medicine, specifically PubMed. More on how to do that in the next section, where I'll walk you through the first part of the data science pipeline: Data Collection!

Imports¶

You may notice we have quite a few imports here, and that's for a good reason. Many of these provide specific functionality that we want later in the project, whether that be for visualization or NLP.

In [1]:
import json
import requests
import calendar
import numpy as np
import nltk
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

from xml.etree import ElementTree as ET
from datetime import datetime, timedelta
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report
from scipy.stats import chi2_contingency

Below we are just making sure that our natural language toolkit packages are updated

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\redra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\redra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[2]:
True

Data Collection / Curation¶

The functions below, search_articles and fetch_article_details work in tandem to access PubMeds e-utils API, first performing an e-search and fetching PMIDs all at once for articles based on our unique search criteria. PMIDs are unique identifiers for PubMed articles, and we use them in fetch_article_details to access a specific article. Since we need more than just the PMID that search_articles gives us for our analysis, we pass a list of PMIDs and our desired information to fetch_article_details. This includes the title of the article, publication date, and the MeSH keywords (short for Medical Subject Heading). You'll notice that fetch_article_details process the list of PMIDs in batches. This is done to avoid overwhelming the PubMed API with too many requests at once, which causes errors in the retrieval.

Data Management¶

You will also see the functions save_as_json and load_json_to_dataframe below. These are simple functions that are used to avoid redundant queries to the PubMed API. More on that below!

In [3]:
def search_articles(term, retmax, api_key, start_date, end_date):
    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'pubmed',
        'term': term,
        'retmax': retmax,
        'retmode': 'xml',
        'api_key': api_key,
        'datetype': 'pdat',
        'mindate': start_date,
        'maxdate': end_date
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        tree = ET.fromstring(response.content)
        pmid_list = [id_elem.text for id_elem in tree.findall(".//Id")]
        return pmid_list
    else:
        print(f"Error: Unable to search articles for term '{term}' on dates between {start_date} and {end_date}")
        return []

def fetch_article_details(pmid_list, api_key, batch_size=250):
    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    articles = []
    for start in range(0, len(pmid_list), batch_size):
        end = start + batch_size
        batch_pmids = pmid_list[start:end]
        params = {
            'db': 'pubmed',
            'id': ','.join(batch_pmids),
            'retmode': 'xml',
            'api_key': api_key
        }
        response = requests.get(url, params=params)
        if response.status_code == 200:
            tree = ET.fromstring(response.content)
            for article in tree.findall(".//PubmedArticle"):
                pmid = article.find(".//PMID").text
                title = article.find(".//ArticleTitle").text
                pub_date_elements = article.find(".//PubDate")
                year = pub_date_elements.find("Year").text if pub_date_elements.find("Year") is not None else ""
                month = pub_date_elements.find("Month").text if pub_date_elements.find("Month") is not None else ""
                day = pub_date_elements.find("Day").text if pub_date_elements.find("Day") is not None else ""
                pub_date = f"{year}-{month}-{day}".strip("-")
                mesh_terms = []
                mesh_heading_list = article.findall(".//MeshHeading")
                for mesh_heading in mesh_heading_list:
                    descriptor = mesh_heading.find("DescriptorName")
                    if descriptor is not None:
                        mesh_terms.append(descriptor.text)
                articles.append({
                    'pmid': pmid,
                    'title': title,
                    'pub_date': pub_date,
                    'mesh_terms': mesh_terms
                })
        else:
            print(f"Error: Unable to fetch details for PMIDs: {batch_pmids}")
    return articles

def save_as_json(data, filename):
    with open(filename, "w") as json_file:
        json.dump(data, json_file, indent = 4)

def load_json_to_dataframe(json_filename):
    with open(json_filename, "r") as json_file:
        data = json.load(json_file)
    df = pd.DataFrame(data)
    return df

More on Data Management and Parsing¶

You may notice that we import XML ElementTree, which is because that is the default return type of the data from the PubMed API. I prefer working with JSON, so by parsing the XML output into strings and then saving them into a list, we can easily save the article details as JSON.

Another way that we add redundancy to the data collection is by fetching articles on a per-month basis and storing them in seperate JSON files. We do this not only to allow for plotting of article topics over time but also because it helps with ensuring the API is not overwhelmed with requests. For right now, we are initially collecting 1000 articles per month on genetic disease, however you may notice that these parameters are easily adjustable, and with only a few small tweaks this project could be generalized to many other fields. Just be aware, increasing the number of articles fetched per month may cause errors if there isn't enough articles to populate the PMID list and is guaranteed to increase the time it takes to collect the data.

We use datetime and timedelta to loop through the months, collecting articles from 2023-01-01 to 2023-12-31 one month at a time. Then, we take the master list of all article details and save it to a JSON as well.

In [4]:
# The search term for the esearch request
search_term = 'genetic disease'
# Number of articles to return
retmax = 1000
api_key = "cb3fe1ce8ab6d94d3cdac536509137619908"
start_date = datetime(2023, 1, 1)  # Start date at the beginning of the year
all_articles = []

# Fetch and store data monthly
for i in range(12):  # Loop over 12 months
    month_start = start_date.replace(day=1) + timedelta(days=i*31)
    last_day = calendar.monthrange(month_start.year, month_start.month)[1]
    month_end = month_start.replace(day=last_day)
    month_start_str = month_start.strftime('%Y/%m/%d')
    month_end_str = month_end.strftime('%Y/%m/%d')
    pmid_list = search_articles(search_term, retmax, api_key, month_start_str, month_end_str)
    if pmid_list:
        articles = fetch_article_details(pmid_list, api_key)
        filename = f'articles_{month_start_str.replace("/", "-")}_to_{month_end_str.replace("/", "-")}.json'
        save_as_json(articles, filename)
        all_articles.extend(articles)
    else:
        print(f"No articles found for {month_start_str} to {month_end_str}")

save_as_json(all_articles, "all_genetic_disease_articles.json")

Further Data Handling¶

There are a few issues wih the data we've collected so far. Firstly, some articles are entirely missing MeSH terms, which are crucial to our analysis. We will drop these articles. Next, we will make the titles and mesh_terms lowercase for all articles. This will allow for data homogeneity and ease of manipulation. Our total number of articles goes from 12,000 to around 7,500 here!

You may be wondering why we can't just populate the mesh_terms for a given article based on the title or by performing analysis of the abstract. Well, it can be done but I believe it is outside the scope of this project and introduces far too much computational complexity and gray-area for "forcing" desired outcomes. Previously, I created a list of all MeSH terms contained in the DF and then populated the empty MeSH lists by seeing if any words in the title matched. This seemed like a good idea at first, by expanding our available data we can perform more complete analysis, but in reality it ended up doing more harm than good since it increased the dominance of terms that are not useful for topic prediction.

This is where I was using the NLP libraries, I would take the article titles and MeSH terms and lemmatize them, trying to retain maximal meaning in as few words as possible. Then I would populate the terms lists and try to perform my analysis. From what I said earlier, I noticed that words like "human" and "animal" as well as descriptors of study methods and patient age became dominant and reduced the insight potential of the data.

In [5]:
df = load_json_to_dataframe("all_genetic_disease_articles.json")
df = df.dropna()
df['mesh_terms'] = df['mesh_terms'].apply(lambda terms: [term.lower() for term in terms] if terms else [])
df['title'] = df['title'].str.lower()
df = df[df['mesh_terms'].map(len) > 0]
df = df.reset_index(drop = True)

Converting to datetime¶

Now we convert the publication dates to datetime objects to allow us to plot them over time and eventually to try and fit a classification algorithm to them. This requires some parsing because there is inconsistent formatting in the pub_date strings

In [6]:
def convert_date(date_str):
    try:
        return pd.to_datetime(date_str, format='%Y-%b-%d', errors='raise')
    except ValueError:
        try:
            return pd.to_datetime(date_str, format='%Y-%b', errors='raise')
        except ValueError:
            return pd.to_datetime(date_str, format='%Y', errors='coerce')

df['pub_date'] = df['pub_date'].apply(convert_date)

Exploratory Data Analysis¶

Now we are going to start trying to visualize our data. There are a few ways we'll do this, including a simple bar graph, a heatmap of terms that frequently appear together, and a word cloud visualization of a simple KMeans clustering.

Bar Graph¶

This is a simple bar graph that shows the top 50 MeSH terms by number of occurances in our dataset. You may notice that common words include things like humans, animals, female, male, adult, child etc. These are minimally useful in understanding the actual topic of the paper, so they will not be included in the visualizations after this one.

In [7]:
# Count the frequency of each MeSH term
mesh_term_counts = df.explode('mesh_terms')['mesh_terms'].value_counts().head(50)

# Plot the vertical bar chart
plt.figure(figsize=(12, 8))
mesh_term_counts.plot(kind='bar', color='skyblue')
plt.title('Top 50 MeSH Terms')
plt.xlabel('MeSH Terms')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
No description has been provided for this image

Co Occurence Matrix¶

In the cell below we create a co occurence matrix for MeSH terms in the data. This will count the number of times that words appear together. It is computationally expensive to compute, since it's a big matrix and we are iterating through it multiple times. We will use it for our heatmap later.

In [8]:
all_terms = set(term for sublist in df['mesh_terms'] for term in sublist)
all_terms_list = list(all_terms)  # Covert set to list

# Create a DataFrame where index and columns are the MeSH terms
co_occurrence_matrix = pd.DataFrame(0, index=all_terms_list, columns=all_terms_list)

for terms in df['mesh_terms']:
    for term1 in terms:
        for term2 in terms:
            if term1 != term2:
                co_occurrence_matrix.at[term1, term2] += 1

Heatmap Visualization of Co-Occuring Terms¶

Below we use seaborn's heatmap to visualize frequently co-occuring terms. We drop things that are bound to appear frequently and add little to the data as we've stated before, included in the excluded_terms list. This will have to be updated with domain knowledge / further exploratory data analysis if you wish to further refine the insights provided here or change the topic that the analysis is being performed on. While these are very general terms, you may notice things like "mutation" are left included, because while that is a common term in the study of genetic disease, it does convey meaning in context.

We also perform normalize the data here, using z-score normalization to allow for easier comparison of terms, and to handle other potentially uninsightful terms appearing too often.

In [9]:
# List of terms to exclude
excluded_terms = {"humans", "animals", "male", "female", "adult", "child", "adolescent", "middle aged", "infant, newborn", "animal", "mice"}

# Filter the co-occurrence matrix to remove excluded terms
filtered_matrix = co_occurrence_matrix.drop(index=excluded_terms, columns=excluded_terms, errors='ignore')

if filtered_matrix.empty:
    print("Filtered matrix is empty. Check the excluded terms or initial data.")
    
mean = filtered_matrix.mean()
std = filtered_matrix.std()

# Prevent division by zero by adding a small number to std if it's zero
std = std.replace(0, 0.0001)

# Apply z-score normalization
normalized_matrix = (filtered_matrix - mean) / std

term_frequencies = normalized_matrix.sum(axis=1).sort_values(ascending=False)

top_terms = term_frequencies.head(10).index.tolist()

co_occurrence_strength = normalized_matrix.sum(axis=0).sort_values(ascending=False)

top_co_occurring_terms = co_occurrence_strength.head(10).index.tolist()

relevant_terms = list(set(top_terms).union(set(top_co_occurring_terms)))

focused_matrix = normalized_matrix.loc[relevant_terms, relevant_terms]

# Plotting the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(focused_matrix, cmap='coolwarm', annot=False, center=0)
plt.title('Heatmap of Normalized Selected MeSH Terms Co-occurrence')
plt.show()
No description has been provided for this image

List of top 25 most correlated terms¶

Here we print out a list of the top 25 most correlated terms. We filter out redundant pairs to avoid pairs of the form (x, y) apearing again as (y, x). As you can see, we are getting a bit more insight now into what subject headings appear most commonly together. We can use this to help us come up with cluster labels for a KMeans clustering algorithm and more generally to get an idea of what we can expect to see in each cluster.

In [10]:
flat_matrix = focused_matrix.stack()

# Filter out redundant pairs by ensuring the first term is lexicographically smaller than the second
unique_pairs = pd.Series({(min(term1, term2), max(term1, term2)): value 
                          for (term1, term2), value in flat_matrix.items()})

top_values = unique_pairs.sort_values(ascending=False).head(25)

# Print the top 25 values with their corresponding term pairs
for (term1, term2), value in top_values.items():
    print(f"({term1}, {term2}): {value:.2f}")
(genetic predisposition to disease, polymorphism, single nucleotide): 61.83
(mutation, phenotype): 60.32
(genetic predisposition to disease, genome-wide association study): 41.98
(genotype, vero cells): 32.58
(genotype, polymorphism, single nucleotide): 32.07
(genetic predisposition to disease, genotype): 31.64
(genotype, phenotype): 30.43
(genome-wide association study, polymorphism, single nucleotide): 29.94
(nippostrongylus, strongylida infections): 27.25
(mutation, retrospective studies): 21.07
(genotype, mutation): 18.26
(genotype, pregnancy): 17.03
(pregnancy, retrospective studies): 15.32
(genetic predisposition to disease, phenotype): 14.44
(mutation, pregnancy): 13.59
(dermatitis, atopic, retrospective studies): 13.01
(disease models, animal, mutation): 12.46
(disease models, animal, phenotype): 12.46
(dermatitis, atopic, disease models, animal): 11.49
(genetic predisposition to disease, retrospective studies): 11.29
(phenotype, retrospective studies): 10.14
(genotype, retrospective studies): 8.41
(genetic predisposition to disease, mutation): 8.31
(genotype, interferons): 8.13
(genome-wide association study, phenotype): 7.68

Continued EDA, KMeans clustering¶

Here we perform KMeans clustering on the MeSH terms for use in visualizing the common major topic areas and coming up with a hypothesis for testing. We define custom stop words for use with our vectorizer to extract maximal insight potential. We can vary the number of clusters to increase / decrease the scope of each cluster topic.

In [11]:
csw = [
    "humans", "animals", "male", "female", "adult", "child", "adolescent", "disease", "diseases", "animal", "model", "factors", "factor",
    "middle", "aged", "infant", "newborn", "to", "of", "and", "study", "studies", "syndrome", "models", "retrospective", "type"
]
vectorizer = TfidfVectorizer(stop_words=csw)
df['cluster_terms'] = df['mesh_terms'].apply(' '.join)
X = vectorizer.fit_transform(df['cluster_terms'])

# Number of clusters
k = 12

# Initializing KMeans
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Cluster labels
cluster_labels = kmeans.labels_
df['cluster_id'] = cluster_labels

# Determine the top 5 terms for each cluster
terms = vectorizer.get_feature_names_out()
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

# Create a mapping from cluster ID to the top 5 terms
cluster_terms = {}
for i in range(k):
    top_five_terms = ', '.join([terms[ind] for ind in order_centroids[i, :5]])
    cluster_terms[i] = top_five_terms

# Map the descriptive labels to the cluster IDs in the dataframe
df['cluster_label'] = df['cluster_id'].map(cluster_terms)

Wordcloud Visualization¶

Here we are taking each cluster from the KMeans clustering we performed on the MeSH terms and printing out the wordcloud representation of the most commonly occuring terms in that cluster. This helps us get an idea of what our cluster labels should look like, which we will be using to group articles for our next step, where we plot them over time.

In [12]:
feature_names = vectorizer.get_feature_names_out()
cluster_centers = kmeans.cluster_centers_

# Generating word clouds for each cluster
for i in range(k):
    # Extract the cluster center values and map them to the corresponding words
    terms_scores = {feature_names[j]: cluster_centers[i, j] for j in range(len(feature_names)) if cluster_centers[i, j] > 0}
    
    # Create the word cloud
    wordcloud = WordCloud(background_color='white', max_words=50).generate_from_frequencies(terms_scores)
    
    # Plotting
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(f'Word Cloud for Cluster {i}')
    plt.axis('off')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Last of the EDA¶

Graph of Cluster Size over Time¶

Here is the culmination of our exploratory data analysis, a plot of the number of articles per cluster over time for 2023-2024. We exlude January 2023 here since any articles that had a publication date of the form yyyy were assigned 2023, and it led to skewed data. This plot indicates that there are some areas of research that are overly dominant, especially those involving mutations, pregnancy, and covid-19. There is some more nuance that could be added here, like additionally comparing the rate of growth of the clusters month over month as opposed to the raw magnitude of articles. I also think that this concept would work better with a much larger and more comprehensive dataset, maybe on the magnitude 10-100x the size and spanning a larger timeframe. Additionally, more tuning of the stopwords could prevent oversimplification of topics, and lead to a better representation.

The next step from here is to use the insight and data we've gained to perform some hypothesis testing and machine learning!

In [13]:
# Filter the DataFrame for the desired date range
df_filtered = df[(df['pub_date'] >= '2023-02-01') & (df['pub_date'] < '2024-01-01')]
df_filtered = df_filtered.reset_index(drop=True)

df_filtered.set_index('pub_date', inplace=True)

grouped = df_filtered.groupby('cluster_id').resample('ME').size()

fig, ax = plt.subplots(figsize=(15, 9))
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'orange']  # Ensure there are enough colors for each cluster

# Plot data for each cluster
for cluster_id, data in grouped.groupby(level=0):  # groupby level=0 groups by cluster_id
    label = cluster_terms.get(cluster_id, f'Cluster {cluster_id}')
    data = data.reset_index(level=0, drop=True)  # Drop the 'cluster_id' level
    data.plot(ax=ax, label=label, color=colors[cluster_id % len(colors)])

# Set plot titles and labels
ax.set_title('Number of Articles by Cluster Terms Over Time')
ax.set_xlabel('Date')
ax.set_ylabel('Number of Articles')
ax.legend(title='Cluster Terms', loc='upper right')

# Set the x-axis date labels at a 45-degree angle
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()
No description has been provided for this image

Hypothesis Testing¶

Contingency Table and Chi Square testing¶

Here we construct a contingency table that displays the frequency distribution of the most commonly occuring MeSH terms. It captures how often pairs of MeSH terms co-occur within the same article. Similar to the co-occurrence matrix. It allows for statistical testing, which we will be performing a chi-square test of independence. We want to determiine whether there is a statistically significant assosciation between different MeSH terms.

Null Hypothesis: The MeSH terms appear independently and are simply appearing together due to random chance

We run the test and conclude that there is exceedingly strong evidence that we should reject the null hypothesis. It is clear that MeSH terms that appear together frequently are highly correlated

In [14]:
# Create a contingency table for the top 10 MeSH terms
top_mesh_terms = mesh_term_counts.head(10).index.tolist()
contingency_table = pd.DataFrame(0, index=top_mesh_terms, columns=top_mesh_terms)

for terms in df['mesh_terms']:
    for term1 in terms:
        if term1 in top_mesh_terms:
            for term2 in terms:
                if term2 in top_mesh_terms and term1 != term2:
                    contingency_table.at[term1, term2] += 1

# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-squared: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")

if p < 0.05:
    print("Reject null hypothesis: Certain MeSH terms co-occur more frequently than would be expected by chance.")
else:
    print("Fail to reject null hypothesis: MeSH terms do not co-occur more frequently than would be expected by chance.")
Chi-squared: 15238.23631882308
P-value: 0.0
Degrees of freedom: 81
Reject null hypothesis: Certain MeSH terms co-occur more frequently than would be expected by chance.

Additional Chi Square testing¶

In this test we want to determine whether the proportion of articles in each cluster changes over time. We compare the amount of data in the first month and it's distribution among clusters to the last month and conclude that indeed there is a shift in the proportion of articles, however this conclusion isn't as strong as our last hypothesis, as the P-Value is much higher.

In [15]:
# Extract data for first and last month
first_month = df[df['pub_date'].dt.month == 1]
last_month = df[df['pub_date'].dt.month == 12]

# Count articles in each cluster
first_month_counts = first_month['cluster_id'].value_counts().sort_index()
last_month_counts = last_month['cluster_id'].value_counts().sort_index()

contingency_table = pd.DataFrame({'first_month': first_month_counts, 'last_month': last_month_counts}).fillna(0)

# Perform chi-square
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-squared: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")

if p < 0.05:
    print("Reject null hypothesis: The proportion of articles in different clusters has changed over the year.")
else:
    print("Fail to reject null hypothesis: The proportion of articles in different clusters has not changed significantly over the year.")
Chi-squared: 21.640078971459253
P-value: 0.027314643925154484
Degrees of freedom: 11
Reject null hypothesis: The proportion of articles in different clusters has changed over the year.

Machine Learning¶

Logistic Regression for Prediction of Cluster¶

Now that we have a nice spread of information at our disposal, we can use machine learning tools to create predictions about the data moving forward. In this first application, we implement logistic regression in an attempt to provide a method of classification for an article given it's MeSH terms. The current model has a 92% classification accuracy for the 12 clusters we currently have. Looking at the feature importance list, which displays the weights of the terms, It's interesting but unsuprising that "cystic" and "fibrosis" are at the top. These two terms are some of the most frequently appearing in their cluster, and nearly always appear together. This makes them unsuprisingly valuable in cluster prediction.

In [16]:
# Prepare features and labels
vectorizer = TfidfVectorizer(stop_words=csw)
df['cluster_terms'] = df['mesh_terms'].apply(' '.join)
X = vectorizer.fit_transform(df['cluster_terms'])
y = df['cluster_id']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

# Show the importance of features (terms)
feature_importance = pd.Series(clf.coef_[0], index=vectorizer.get_feature_names_out()).sort_values(ascending=False)
print(feature_importance.head(10))
Accuracy: 0.9221635883905013
              precision    recall  f1-score   support

           0       1.00      0.86      0.92        42
           1       0.88      1.00      0.94       705
           2       1.00      0.92      0.96        62
           3       0.95      0.84      0.89        99
           4       1.00      0.74      0.85        70
           5       0.97      0.81      0.88       122
           6       0.91      0.84      0.88       171
           7       1.00      0.94      0.97        34
           8       1.00      0.81      0.90        32
           9       1.00      0.73      0.84        26
          10       0.98      0.97      0.98       108
          11       1.00      0.89      0.94        45

    accuracy                           0.92      1516
   macro avg       0.97      0.86      0.91      1516
weighted avg       0.93      0.92      0.92      1516

cystic           7.959856
fibrosis         7.579474
lung             1.531598
pseudomonas      1.413329
transmembrane    0.914473
conductance      0.911853
anti             0.884677
regulator        0.881329
aeruginosa       0.850212
bacterial        0.807151
dtype: float64

Linear Regression Model for Predicting Number of Articles Published Next Month¶

Below is a simple linear regeression model that takes the data about the number of articles published per month per cluster and predicts how many articles will be published per cluster next month. Again, this model suffers due to the relatively low number of clusters and observations, as well as the amount of history it has to train on. However, we can see the predictions for the next month for each cluster which is valuable in forecasting growth or decline in areas of study. This leads to the last part of the tutorial, which is the takeaway.

In [17]:
# Predict the number of articles for each cluster in the next month
predictions = {}

# Iterate over
for cluster_id, data in grouped.groupby(level=0):
    # Drop the 'cluster_id' level
    data = data.droplevel(0)
    
    # Prepare the data for linear regression
    X = np.arange(len(data)).reshape(-1, 1)  
    y = data.values 
    
    # Fit the Linear Regression model
    model = LinearRegression()
    model.fit(X, y)
    
    # Predict the next month's value
    next_month = np.array([[len(data)]])
    forecast = model.predict(next_month)
    predictions[cluster_id] = forecast[0]

# Display the predictions
for cluster_id, forecast in predictions.items():
    print(f"Cluster {cluster_id}: Predicted number of articles for the next month: {forecast:.2f}")
Cluster 0: Predicted number of articles for the next month: 18.16
Cluster 1: Predicted number of articles for the next month: 285.58
Cluster 2: Predicted number of articles for the next month: 24.15
Cluster 3: Predicted number of articles for the next month: 45.33
Cluster 4: Predicted number of articles for the next month: 21.85
Cluster 5: Predicted number of articles for the next month: 49.80
Cluster 6: Predicted number of articles for the next month: 80.04
Cluster 7: Predicted number of articles for the next month: 11.60
Cluster 8: Predicted number of articles for the next month: 17.64
Cluster 9: Predicted number of articles for the next month: 10.42
Cluster 10: Predicted number of articles for the next month: 41.22
Cluster 11: Predicted number of articles for the next month: 14.78

Takeaways / Message¶

Predictive Value¶

The insight that this project provides could be valuable in determining which areas of study are most rapidly growing, which could encourage policy or funding decisions. It could also indicate which areas appear less in publications, which might lead to new research developments in less popular areas.

Areas for improvement¶

Like I've touched on previously, there is a lot of room for improvement in my project. If I was to expand this idea, I would first contact the National Library of Medicine to increase my API requests / second limit. I would collect a much more comprehensive and larger dataset, spanning a much larger timeframe. More sophisticated methods of understanding the topic of an article could be put in place, expanding the diversity and accuracy of the clusters. Performance could also be improved by leveraging the GPU or parallelism. Overall, this was a very fun tutorial to make and I hope it helped you understand the basics of the Data Science Pipeline!